Overview of MVGE.
Top-Left: MVGE consists of a ViT backbone that processes video input frames, followed by a temporal decoder with cross-attention and dynamic NTK scaling RoPE, producing scale-invariant point maps.
Top-Right: Cross-frame geometric consistency enforced across global and local geometric levels (G
1, G
2) to maintain structural coherence across frames.
Bottom-Left: RoPE with dynamic NTK scaling applied to extend sequence context, using frequency scaling that adaptively weights dimensions based on scale factor, and train-time sequence stretching that creates a virtual extended sequence to sample positions.
Bottom-Right: Hierarchical temporal consistency constraints applied multiple temporal strides (δ = 1, 2, 4, 8) to enforce smooth, consistent point map predictions across time.